Annina Haverinen 19.11.2018
Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. < em >J. Environ. Economics and Management < b >5, 81???102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) < em >Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley
Variables in the dataset:
library(MASS)
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Names of the variables in the data
506 observation of 14 variables
gather(Boston) %>%
ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.
library(tidyr)
library("ggplot2")
library("GGally")
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix
p1
?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)
p2
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).
scaled_boston <- scale(Boston)
summary(scaled_boston)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404 14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102 14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2425743 0.2549505 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.9642146 -0.9121498 -0.116404306 -0.8879627 0.4433907
## med_low -0.1499109 -0.2752347 -0.071456607 -0.5576972 -0.1614688
## med_high -0.3690158 0.1267050 0.224586496 0.3787396 0.1191614
## high -0.4872402 1.0171096 -0.002135914 1.1104597 -0.4659708
## age dis rad tax ptratio
## low -0.8545322 0.9268423 -0.6964601 -0.7358577 -0.44959809
## med_low -0.2736328 0.2806592 -0.5447506 -0.4809842 -0.06901381
## med_high 0.4066057 -0.3633938 -0.4745387 -0.3809150 -0.30234605
## high 0.8071096 -0.8593042 1.6382099 1.5141140 0.78087177
## black lstat medv
## low 0.3789899 -0.76574214 0.51263076
## med_low 0.3164544 -0.12451730 -0.01743563
## med_high 0.1320563 -0.01255628 0.21547617
## high -0.7429107 0.91525025 -0.68483433
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08840379 0.70557144 -0.917616680
## indus 0.08612616 -0.12471345 0.320382732
## chas -0.02779654 -0.04817284 0.005687493
## nox 0.40469722 -0.70717805 -1.573586440
## rm -0.04662213 -0.05333190 -0.238574268
## age 0.12086530 -0.26798321 -0.053459745
## dis -0.06726274 -0.12188315 -0.080340692
## rad 4.01529318 0.96815220 0.007205622
## tax 0.06537236 -0.06683135 0.560167160
## ptratio 0.13465626 0.03802825 -0.453311859
## black -0.07825554 0.03975242 0.110194110
## lstat 0.20234831 -0.29786104 0.239076638
## medv 0.09844462 -0.39112380 -0.240746769
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9644 0.0262 0.0094
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)
##Predict with LDA
# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
## prediction
## correct low med_low med_high high Sum
## low 19 9 0 0 28
## med_low 7 16 4 0 27
## med_high 0 3 15 0 18
## high 0 0 4 25 29
## Sum 26 28 23 25 102
Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27
library(MASS)
data("Boston")
scaled_boston <- scale(Boston)
#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km<- kmeans(scaled_boston,3)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
## Length Class Mode
## cluster 506 -none- numeric
## centers 70 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
3 clusters would be appropriate.
pairs(scaled_boston, col=km$cluster)
data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506 14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.46047431 0.25296443 0.20355731 0.08300395
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3770786 -0.3361581 -0.357024 0.01492717 -0.3627826 -0.05942108
## 2 0.3860352 -0.4872402 1.177101 0.09677408 1.1367747 -0.46615004
## 3 -0.4083200 1.5646181 -1.071144 -0.04298342 -1.0195629 0.88181624
## 4 1.9167567 -0.4872402 1.020131 -0.27232907 1.0484799 -0.41225609
## age dis rad tax ptratio black
## 1 -0.0510395 0.07344745 -0.5767039 -0.6112497 -0.1132745 0.2885204
## 2 0.8113429 -0.84147223 1.0055151 1.1534294 0.5524522 0.1238290
## 3 -1.2147329 1.24283836 -0.6005355 -0.6591517 -0.7342052 0.3535227
## 4 0.7894713 -0.89088476 1.6076484 1.4922585 0.7452908 -2.8449572
## lstat medv
## 1 -0.2079635 0.07175058
## 2 0.7316707 -0.52309197
## 3 -0.9453267 0.94512754
## 4 1.2421501 -1.12167260
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.279080804 -0.79877867 -0.08683356
## zn 0.030140565 -0.33285057 1.26937453
## indus 0.824042075 0.43909568 0.42582133
## chas -0.052691582 0.02323526 -0.02531932
## nox 0.880477101 0.52650678 0.88365117
## rm -0.040434271 -0.22406650 0.36800546
## age -0.045316679 0.24162605 -0.57110107
## dis -0.047493769 -0.02490285 0.49487630
## rad 0.634634714 0.16879880 0.07074025
## tax 0.617515825 0.32440160 0.68401593
## ptratio 0.311095137 0.03084074 0.20476753
## black -0.727324754 1.84389704 0.52511602
## lstat 0.220293087 -0.14348056 0.44187829
## medv 0.002872153 -0.12636659 0.57372717
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.6912 0.2028 0.1060
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)
##superbonus
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda1$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)